The following authors
are listed in alphabetical order by last name:
William Chao |
|
will@ubcviscog.com |
Daniel Ha |
School of Interactive Arts & Technology (SIAT), |
dha1@sfu.ca |
Kevin Ho |
|
kevin@ubcviscog.com |
Linda Kaastra |
Media and Graphics Interdisciplinary Centre (MAGIC), |
lkaastra@interchange.ubc.ca |
Minjung Kim |
|
minjung@ubcviscog.com |
Andrew Wade |
|
andrewwade@ubcviscog.com |
This project
was supported by the Natural Science and Engineering Research Council of
Canada.
Student team: [ X ] YES [ ] NO
If you answered yes, name the faculty who agreed to be your
sponsor: Brian Fisher, bfisher@sfu.ca
For the VAST 2007 contest problem, our team used a variety of commercial and open source tools to support our analytic task. We did not pre-select a specific toolset to support this task, but rather evaluated and selected both specialized and generic tools along the way for answering emergent questions, generating plausible hypotheses, and supporting evidence-based argumentation. We refer to this as the bricolage approach: the application of multiple analytic methods by multiple analysts, using an assortment of specialized tools that could be combined to tackle a complex problem space.
We used a large number of different tools, consisting
primarily of those that were both easy to learn and readily available. Of
these, nine tools in particular contributed significantly to our overall
understanding of the problem and are summarized in the chart below. These tools
are described in further detail throughout Section 5, Visuals and Description
of the Analytical Process. The links lead to the websites of the software.
Tool |
Developer |
Description |
Release Used |
Open Source? |
Text and Qualitative Data Analysis |
||||
ATLAS.ti Scientific Software Development GmbH |
Qualitative data analysis tool capable of automatically coding a data set. |
5.2.9 |
No |
|
Stanford Natural Language Processing Group |
Identifies and tags every word with its part-of-speech (e.g., noun, verb). Used to identify proper nouns, and hence, names of people. |
2006-05-21 |
Yes |
|
Dutch
Linguistics, Free University of |
Text analysis tool, used to search and identify occurrences of important names or words. |
2.7 |
Yes |
|
Search Utility |
Organizers of VAST 2006 contest |
Text searching tool provided with the pre-processed VAST 2006 data set. Used to search and identify occurrences of important names or words. |
VAST 2006 |
No |
|
||||
Data Sharing and Visualizations |
||||
Multiple authors; primary author is Alexander Larson |
Diagramming tool. |
0.96.1-7 |
Yes |
|
Multiple authors; primary contributors are Jörg Müller, Daniel Polansky, Petr Novak, Christian Foltin, and Dimitry Polivaev |
Mind-mapping software. Used to take notes. |
0.8.0 |
Yes |
|
|
Web applications that mimic the capabilities of Microsoft Word and Excel, but with an additional social and collaborative dimension. Used to collect and discuss notes as a group. |
2007 |
No |
|
AT&T |
Diagramming tool. Automatically generates relationship diagrams based on simple, tab-delimited text. |
2.12 |
Yes |
|
Microsoft Excel |
Microsoft |
Spreadsheet software. |
2003 |
No |
IHMC |
Diagramming tool. Automatically generates relationship diagrams based on simple, tab-delimited text. |
4.10 |
No |
|
Progeny Software, Inc. |
Used to make timelines. |
Trial version |
No |
Data set used: [ X ] RAW DATA SET
[ ] PRE-PROCESSED SET
TOC: Who – What – Where – Debriefing - Process - Video
Name |
Associated
organization |
Involved
in illegal activities? (Yes/No) |
Involved
in terrorist activities? (Yes/No) |
Most
relevant source files (5 MAX) |
Abu Hassan |
Global Ways, Professor Assan and His Amazing Animals |
Yes |
No |
Week-of-Mon-20031215-1.txt_91, Week-of-Mon-20040301-1.txt_75, ImportPermitsv3 BEST WORKING COPY |
Catherine Carnes |
SPOMA |
No |
No |
Chinchilla Dreamin’, Week-of-Mon-20030526-2.txt_57, Week-of-Mon-20030818.txt_23 |
Cesar Gil |
None |
Yes |
Yes |
Chinchilla Dreamin’, Week-of-Mon-20030609.txt_4, Week-of-Mon-20040705.txt_86 |
Faron Gardner |
Animal Justice League |
Yes |
Yes |
Chinchilla Dreamin’, Week-of-Mon-20030602-1.txt_66, Week-of-Mon-20030818.txt_23 |
Luella Vedric |
SPOMA |
Yes |
No |
Week-of-Mon-20030526-2.txt_57, Week-of-Mon-20031013.txt_4, Week-of-Mon-20040119-1.txt_98, Week-of-Mon-20040412-2.txt_13 Week-of-Mon-20040705.txt_83, |
Madhi Kim |
Global Ways |
Yes |
No |
Week-of-Mon-20040308.txt_109, Week-of-Mon-20040412-2.txt_13 |
Navarro Mercurio |
Global Ways |
Yes |
No |
meeting, Tropical Fish Importers |
r’Bear |
Shravaana / Shraavana |
No |
No |
Week-of-Mon-20030609.txt_7, Week-of-Mon-20040412-2.txt_13, Week-of-Mon-20040628.txt_61, Week-of-Mon-20060614.txt_94 |
Rosalind Baptista |
Unknown |
Yes |
No |
Chinchilla Dreamin’, hunt8, meeting |
|
Date
|
Event
description |
Most
relevance source files (5
Max) |
1 |
July 18, 2003 |
Rosalind Baptista is seen poaching chinchillas. |
hunt8 |
2 |
August 15, 2003 |
Cesar Gil becomes a chinchilla farmer. |
Chinchilla Dreamin’ |
3 |
September 1, 2003 |
Cesar Gil announces that Gil Breeders is selling
chinchillas at |
Week-of-Mon-20030901-1.txt_36 |
4 |
September 22, 2003 |
Global Ways advertises their fish import service, highlighting their low rates of death-on-arrival (DOA). |
Week-of-Mon-20030922.txt_28 |
5 |
October 27, 2003 |
Letters to editor complaining about poor Global Ways
shipments are published. Fish shipping bags are noted to be covered in
noxious substance that causes numbness of hands. Global Ways blames an
inexperienced packer in |
cocaine hydro, Transport of Live Fish, Week-of-Mon-20031027.txt_57 |
6 |
December 15, 2003 |
A letter to CITES, urging the shut-down of Assan Circus, is published. Abu Hassan, owner of the circus, is accused of smuggling chimps and parrots, as well as mistreating animals. |
Week-of-Mon-20031215-1.txt_91 |
7 |
January 6, 2004 |
Fish and Wildlife Services issues an advisory to
ornamental fish merchants about contaminated fish shipping packages. Several
fish import companies located in |
Week-of-Mon-20040105-1.txt_58 |
8 |
January 20, 2004 |
The eighth annual SPOMA dinner is hosted by Luella Vedric. The performance by r’Bear, famed rapper, is not accepted very well, despite his donation of $80,000. |
Week-of-Mon-20040119-1.txt_98 |
9 |
March 2, 2004 |
CITES-issued confiscation of Abu Hassan’s circus animals is reported. Hassan is presumed to have fled the country. |
Week-of-Mon-20040301-1.txt_75 |
10 |
March 2, 2004 |
Cesar Gil posts a Chinsurrection comic to his blog, depicting a chinchilla becoming infected with an unnamed disease. |
Chinchilla Dreamin’ |
11 |
March 13, 2004 |
Madhi Kim, the CEO of Global Ways, visits r’Bear’s wildlife preservation ranch, Shravaana. Madhi Kim is reported to own a canned hunting ranch, Wild Things. |
Week-of-Mon-20040308.txt_109 |
12 |
April, 2004 |
Navarro Mercurio (“MN”) and Rosalind Baptista (“RB”)
are photographed meeting in |
Meeting |
13 |
April 18, 2004 |
Nights of |
Week-of-Mon-20040412-2.txt_13 |
14 |
June 2, 2004 |
Cesar Gil posts a Chinsurrection comic to his blog, depicting a mass spread of illness through chinchillas. |
Chinchilla Dreamin’ |
15 |
June 20, 2004 |
r’Bear announces the arrival of over 500 new animals to Shravaana, including some short-tailed chinchillas. |
Week-of-Mon-20040614.txt_94 |
16 |
June 30, 2004 |
Cesar Gil posts a Chinsurrection comic to his blog, depicting the anticipated delivery of sick chinchillas by “Senorita Baptista.” |
Chinchilla Dreamin’ |
17 |
July 1, 2004 |
r’Bear is admitted to the hospital with monkeypox-like symptoms. |
Week-of-Mon-20040628.txt_61 |
18 |
July 7, 2004 |
Seven people in the LA region are reported to be seriously
ill with monkeypox. This is the second monkeypox outbreak in the |
Week-of-Mon-20040705.txt_83 |
19 |
July 24, 2004 |
Two people die from monkeypox. As a result of the outbreak, international animal transportation becomes more tightly regulated. Cesar Gil is wanted in suspicion of connection with the outbreak, but is presumed to have fled the country. |
Week-of-Mon-20040705.txt_86 |
|
Location |
Description |
Most
relevance source files (5
Max) |
1 |
Southern California ( |
Site of monkeypox outbreak in July 2004. Cesar Gil’s
chinchilla farm, Gil Breeders, as well as r’Bear’s wildlife preservation
ranch, Shravaana, are located in the |
Week-of-Mon-20040705.txt_83, Week-of-Mon-20040628.txt_61, Week-of-Mon-20040614.txt_94 |
2 |
|
Location of a Global Ways branch, managed by Navarro Mercurio. Tropical fish imported through this branch in autumn 2003 suffer from high death-on-arrival rates, and are shipped in packaging covered in a noxious substance. |
Week-of-Mon-20031027.txt_57, Week-of-Mon-20040105-1.txt_58, Tropical Fish Importers |
3 |
|
Native habitat of chinchillas, and hence, the site of chinchilla poaching. Rosalind Baptista is photographed hunting here. |
Chinchilla Dreamin’, hunt8 |
4 |
|
The meeting location between Navarro Mercurio (“MN”) and Rosalind Baptista (“RB”). |
Meeting |
5 |
|
Location of a Global Ways branch. Permits for Abu Hassan originate from here. |
ImportPermitsv3 BEST WORKING COPY |
In the spring of
2003, chinchillas gain popularity as the new “fad pets” in the
In particular,
the fad offended Cesar Gil, a biologist in the
The monkeypox
plot culminated in July of 2004, when seven people were reported to be ill with
monkeypox, including the megastar rapper, r’Bear. By July 24, two people had
died from monkeypox. Meanwhile, Cesar Gil was nowhere to be found, presumably
having fled the country.
The
distribution of monkeypox-infected chinchillas can be tied to a
privately-traded company called Global Ways. On the surface, Global Ways
appears to be an import-export company that specializes in the import of rare
and exotic tropical fish. However, Global Ways is also involved in animal and
cocaine smuggling operations, especially from South America and
Global Ways’
involvement with animal smuggling is closely linked with Abu Hassan, the owner
of the circus, “Professor Assan and His Amazing Animals.” Through this African
circus, Hassan obtained exotic species of parrots and chimpanzees to be
imported to the
A series of
tropical fish shipments through
Since the
suspected cocaine was trafficked through
The hypothesis
that Navarro Mercurio and Rosalind Baptista are the identities of M.N. and R.B.
is supported by r’Bear’s strange illness in July of 2004, likely a monkeypox
infection from sick chinchillas. We know that Madhi Kim and r’Bear are on
cordial terms. For instance, on March 13, 2004, Kim was invited to r’Bear’s
wildlife preservation ranch, Shravaana. In mid-April, 2004, r’Bear was invited
to the Global Ways Nights of Champagne and Tropical Fish as a “special guest”
of Kim. Furthermore, they have a mutual friend, Luella Vedric, who shares their
interests in uncommon animals. Therefore, we can postulate that some of the 500
animals r’Bear acquires in June of 2004, including the short-tailed
chinchillas, are supplied by Madhi Kim representing Global Ways. This, combined
with Rosalind Baptista’s infected chinchillas, points
to Navarro Mercurio as a likely candidate that mediates the chinchilla
connection from Cesar Gil to Global Ways, and subsequently to r’Bear.
Unfortunately, how Gil knows of Baptista is unclear, and raises several
questions regarding the nature of their connection: Does she supply him with
chinchillas for his farm? Does he supply her with monkeypox-infected
chinchillas? If so, is she aware of the infection? Who is his informant? This
trail of infected chinchillas is studded with highly suspicious characters, and
warrants further investigation.
We also
recommend an investigation of Luella Vedric, a socialite and an outspoken
member of the Society for the Prevention of Mistreatment of Animals (SPOMA). On
the surface, Vedric appears to be, in every way, an animal rights activist,
even hosting the eighth annual SPOMA dinner in January of 2004. She is also a
long-time friend of Catherine “Collie” Carnes, the spokesperson of SPOMA, and
was reported to be helping track Abu Hassan’s circus to stop animal cruelty.
Yet, simultaneously, she openly associates with Madhi Kim, who owns a canned
hunting ranch and trades with Abu Hassan—the very man she provided information
to stop. These facts place Vedric in an awkward position between innocence and
guilt: Is Vedric genuinely attempting to stop animal abuse by getting
information about Abu Hassan through Kim? Or is she motivated by something
else?
Our analytical
process can be described in four major stages: information generation,
schematization, argumentation and schema shifting, and decision-making. Our
problem solving approach gradually shifted from independent to collaborative
work as we progressed through the stages. The shift was deliberate and the
tools chosen along the way reflect this change.
It should be
noted that, for the purposes of discussion, we describe the analytical process
in neat, segmented phases; the actual problem solving process was neither
discrete, nor always linear.
The process of analysis began by familiarizing ourselves with the provided data set and by discovering categories of information—such as names, places, and events—from the data. Independent information discovery was encouraged at this phase, in order to mitigate the potential for groupthink: a faulty, conforming style of group analysis that can lead to poor decision making. We allowed ourselves to freely discover the data space with little constraints, and discouraged each other from the sharing of individual findings until all team members had a chance to complete at least a single iteration of the information discovery stage. Our intent was to promote the generation of a diverse set of hypotheses. For this reason, most tools and techniques used for information discovery were selected primarily on their ability to support rapid entity extraction.
A preliminary timeline, for instance, was created by placing
news articles and images in a common file folder, and re-naming them to reflect
their date of creation [Fig.1].
This enabled us to get an immediate sense of the data space, allowing us to
approximate both the total quantity of information and the distribution of
information through time.
Figure 1. Windows Explorer. News articles and images are pooled together in a single folder, forming a rough timeline of events. |
Figure 2. Appended news articles in a in a text editor. The location of the scroll bar approximates our position in the overall timeline. |
Given the
relatively small size of the data set, some of the team members opted to glance
through most of the corpus to get a good sense of the themes. We believed that,
by becoming familiar with some of the storylines present, we would be able to
recognize and organize information around thematic categories when the
information was examined in more detail. Rather than read each file
individually, we appended all of the news articles together in a single text
file, then read it in a simple text editor. This allowed us to glance at the
content quickly while scrolling down, as well as use the location of the scroll
bar to estimate our position in the overall timeline [Fig.2]. We also tracked our advancement in an activity log, forcing ourselves
to become conscious of our progress in terms of the competition deadline.
Generation of
entities was accomplished, in part, by the Stanford Part-Of-Speech Tagger (POSTagger), an
open-source software developed by the Stanford Natural Language Processing Group.
It parses a text file and appends a tag to every word present, thereby
identifying its part-of-speech (e.g., noun, verb) [Fig.3]. Using the POSTagger, we were able to
isolate most of the proper nouns—and hence, names of potential interest—present
in the corpus.
Figure 3. The Stanford POSTagger. “Vedric” has been recognized as a proper noun.
We cross-referenced this name with a list of word frequencies, and concluded
that Vedric may be an important character. |
We found that
word frequency was a good heuristic by which to discard unwanted entities. By
limiting ourselves to words that only occur two to five times in the corpus, we
were able to extract some names that were neither extremely common, nor
extremely rare. Words that occur hundreds of times—for instance, PETA—were
deemed to be false leads. Similarly, words that only occur once in the entire
corpus—such as LaRae—were also assumed to be
irrelevant entities. It should be noted that words were not permanently
discarded in this stage; the list of names obtained from the POSTagger were merely considered to be a starting point for
conducting specific searches into the data set, with the understanding that
previously discarded words may, in fact, be important entities that may need to
be re-introduced to our list of entities.
The POSTagger was used in conjunction with other text analysis
tools, such as ATLAS.ti, a commercial tool used in
the social sciences community for supporting analysis of large corpora. It was
used by one of the analysts in the early stages of information discovery to
find and extract entities. ATLAS.ti facilitates fast
searches with its auto-coding tool, which creates custom search scripts using
GREP regular expressions, then runs it against a large number of documents to
codify entities of interest for later analysis [Fig.4a,
4b, 4c]. It also
provides a visual relations editor that allows analysts to assign both known
and hypothetical relations between entities, then to output it as a graphic
file for group discussion and argumentation [Fig.4d].
Figure 4a. ATLAS.ti’s
auto-coding feature. |
Figure 4b. Entities created through auto-coding. |
Figure 4c. Coded data. |
Figure 4d. ATLAS.ti’s network view
of entity relationships. |
Other text
analysis tools include TextSTAT, an open-source software,
and Search Utility, a program provided with the VAST 2006 data set.
Searching for
strings in TextSTAT returns results in its
concordance view, which arranges search terms in context of the text
surrounding them [Fig.5a]. Double-clicking on a search result
opens the citation view, which shows an even larger excerpt surrounding the
search term [Fig.5b]. Figure 5a demonstrates an example search using the string
“chinchilla.” The uniform, vertical alignment of search results in the middle
of the screen allows quick scan of the text snippets, and facilitates finding
articles of interest—in this case, the article relating chinchillas to the
monkeypox outbreak. Each line of the concordance view corresponds to a single
occurrence of the search term; articles containing multiple occurrences of the
search term, therefore, were listed multiple times.
Figure 5a. TextSTAT’s concordance view. A number of characters surrounding the search term are returned along with the search result, providing context. |
Figure 5b. TextSTAT’s citation view. A larger segment surrounding the search result is displayed. |
Unlike TextSTAT, Search Utility only returns one result per news
article [Fig.6]. Search Utility, therefore, allowed us to easily approximate the
number of articles related to the search string, which was difficult to
accomplish with TextSTAT. The two tools were used in
conjunction to complement each other’s features.
Interesting
results from the information discovery stage were vigilantly recorded,
primarily in the form of daily logs that marked our progress and leads. Some of
the analysts chose to depict their newest leads using FreeMind,
an open-source mind-mapping software, in effect
visualizing the locally explored regions of the overall problem space [Fig.7a, 7b]. It provided an effective means by which to monitor the information
that has already been discovered, and subsequently, identify avenues of further
research later on. Figure 7a shows that following up on PetSmart
and Animal Justice League (AJL) led to a connection between Cesar Gil and Faron
Gardner. Whether or not PetSmart’s chinchillas
are directly related to Cesar Gil is, however, unclear. Other leads resulted in
dead ends (e.g., Tony Jones) or exploded into large terrorist stories that did
not appear to be linked to the main plot (e.g., Chiron, thrashing of biology
lab in
Figure 7a. An example of tracking leads using FreeMind. The above image demonstrates a follow-up on the raid on PetSmart by Faron Gardner and the Animal Justice League. |
Figure 7b. Another example of tracking leads using FreeMind. |
This stage
marked the beginning of real collaborative work. Due to the independent nature
of the previous phase, each analyst had an incomplete mental representation of
the data. Differing preliminary hypotheses about the significance of entities
resulted in the usage of different representations to communicate ideas. The
tools selected in this stage, therefore, reflect individual mental
representations of the data.
In order to
facilitate the sharing of mental representations, we created an online wiki for
uploading any information relevant to solving the puzzle. Later, it also
served as an approximate written record of our earlier collaborative work.
Other initial methods of pooling data include a relationship diagram of
entities created using sticky notes [Fig. 8a,
8b], and a
timeline of events using pen and paper [Fig. 9a,
9b].
Figure 9a. Cluster diagram created
using sticky notes. |
Figure 8b. Discussion over the cluster diagram. |
Information was
aggregated using Google Docs and Spreadsheets, which allow many users to collaborate
synchronously on a single document [Fig.10a,
10b]. In order
to avoid overriding each other’s work, we reserved sections by highlighting
cells that we were currently working on.
Figure 10a. Database of entity relations in Google Spreadsheet. The information recorded are as follows: object A, object B, the strength of the evidence that supports their relations, the description of their relations, the date of the evidence, and the source of the evidence. |
Figure 10b. Database of events in Google Spreadsheet. The
information recorded are as follows: event name, start date and time, end date
and time, the category or the sub-plot that the event belongs to, the
location of the event, and miscellaneous notes. |
During this
stage, we used Excel to create some experimental visualizations
in order to get a preliminary sense of the overarching story. For instance, by
sorting the import permits by category and applying conditional formatting, we
noticed that Abu Hassan and
Figure 11a. Conditional formatting reveals that Abu Hassan
and Global Ways trade exclusively with each other. |
Figure 11b. An experimental visualization of Abu Hassan’s import permits across time. An unusual peak in the September 2004 permit is evident. |
Figure
11c. An experimental visualization that shows location of characters across
time. Time and location correspond to the X- and the Y-axis, respectively.
Every colored point indicates the known location of a character at a given
time. Rapper r’Bear, depicted in orange, is often in Shravaana, but attends
Luella Vedric’s 8th Annual SPOMA dinner in |
In this stage,
we converted the pooled information into various diagrams, in an attempt to
make sense of what may be happening. In particular, several network diagrams
were created to visualize the relations between entities. Emphasis was placed
on finding patterns and coherent connections in the data.
GraphViz was
used to generate network diagrams automatically, using the Google Spreadsheets
database created in the previous stage [Fig.12].
The colors of the links in the generated network diagram represented the
strength of the connections between entities. The strength was based on our own
subjective ratings on a scale from one to five. In Figure 12, we observe that
the relations entered into our database were mostly supported by strong
evidence, and were rarely speculative. The diagram also enabled us to
distinguish between uni- and bi-directional
relations, and if unidirectional, helped us determine the direction. Finally, the diagram clustered related
entities in close proximity, enabling a rapid visual assessment of whether two
entities had any relation. The
clustering was also useful in determining the degrees of separation between
entities.
In order to understand
temporal relationships between events, we used the trial version of Timeline
Maker [Fig.13]. One of its most useful features was the ability to color-code events
by category, allowing us to easily understand the unfolding of related events.
CmapTools was another tool used for automatically generating relational information from a database [Fig.14a], which allows manual re-positioning of nodes for organization. It was also used to create a map of motivations, which was used to fuel hypothesis generation by outlining the critical facts and the inferences we could draw from them [Fig.14b]. This diagram was used as a basis for analyzing the loose ends of our stories, and was integral to presenting competing hypotheses.
Figure 14a. Diagram of entity relations generated with CmapTools. |
Figure 14b. Motivations map created using CmapTools. |
During this
stage, we continued to use the cluster of sticky notes [Fig.8a,b], as well as the paper timeline [Fig.9a,b]. Although they were “low tech,” these
representations were powerful in facilitating communication between different
analysts, providing personable space where information was exchanged through
speech, pointing, gestures, and even body language.
Even though we
shared a common set of database and visualizations, we found that forming a
coherent and unified hypothesis was not a simple process. Many scenarios were
plausible, supported in some means by the evidence we had gathered.
We tackled this
issue of competing hypotheses in two ways. First, we held a day of hypothesis
presentations. We took turns explaining our personal views on the solution to
the contest problem, highlighting both the supporting evidence and holes in our
reasoning. Second, we attempted an analysis of competing hypotheses, loosely
following the model proposed by Richard J. Heuer, Jr.
[Fig.15]. Through this exercise, we were able to gauge the dependency of each
hypothesis to different bits of information.
The map of
motivations created during the schematization stage continued to expand in this
stage [Fig.14b], and helped identify the emergence of loose ends and unanswered
questions. From this diagram, we were then able to individually refine our
hypothesis, avoiding explanatory gaps or unsupportable jumps in causality.
The final step
was generating a “skeleton hypothesis” that included only facts and inferences
we felt confident about, leaving out weak links and speculations. The wilder
stories of global intrigue, while excellent at explaining the motivations of
characters, lacked the strength of concrete evidence. The skeleton hypothesis
was later used as a template for the final answer we submitted.
Collaboration
and visualization software discussed in the previous stages of analysis were
re-visited during the decision-making process. Visualizations created using CmapTools, for instance, were modified and transformed to
try and focus in on different aspects of different groups of characters. Much
of the decision-making near the end of the analysis happened face-to-face, and
previously computer-based visualizations facilitated the consensus-making
process that ultimately led to our proposed solution.